Full image here of workflow is here: overview
scRNA seq is a very variable process and each dataset has unique QC problems. Though our simplified approach often works, we also need many more tools in our toolbox to handle more complex datasets. Often it is a case of trial and error to see what approaches work best for your data.
We are going to try out a few tools and approaches. We want to wrap some of our analysis steps into a function to simplify rerunning things.
Normalization: * Log normalization with scale factor = 10,000 * Find Variable features with vst, select top 2000 variable features
Make clusters: * Scale data with ScaleData() * Principle Component Analysis by using RunPCA() with npcs=30 PCs * Make non-linear dimensional reduction in UMAP by using RunUMAP() with dims=1:10 * Estimate Neighbors by using FindNeighbors() with dims=1:10 * Identify clusters with FindClusters() by using resolution=0.5
quick_clust <- function(seu) {
set.seed(42)
seu <- ScaleData(seu, verbose = FALSE)
seu <- RunPCA(seu, npcs = 30, verbose = FALSE)
seu <- RunUMAP(seu, reduction = "pca", dims = 1:10, verbose = FALSE)
seu <- FindNeighbors(seu, reduction = "pca", dims = 1:10, verbose = FALSE)
seu <- FindClusters(seu, resolution = 0.5, verbose = FALSE)
return(seu)
}We will discuss a number of approaches and methods we may need to use to merge datasets.
We will be using a dataset containing 4 snRNAseq samples. This is from a preprint studying Alzheimers disease. Though this dataset has 10 samples, we will be using 4 for our example. There are 2 AD samples and 2 Control samples from the entorhinal cortex taht have been collected post-mortem.
You could access the raw data from GE0: GSE287652. But we have it here already loaded into a Seurat object and after some preprocessing.
Each dataset has been individually processed, QC’d and reviewed prior to combining into a list. Sotring everything in a list will make our life easier later.
We can use merge() function to integrate multiple data sets, as they are. We need to provide a single seurat object as our first argument, and a list containing the rest of our Seurat objects as the second. We also need to provide Group information and a name for the project as arguments.
seu_merge <- merge(my_seu_list[[1]], my_seu_list[2:4], add.cell.ids = c("AD2b", "AD4",
"C1", "C3"), project = "Merge")
head(seu_merge, 4)## orig.ident nCount_RNA nFeature_RNA percent.mt
## AD2b_AAACCCACAGCTGAAG-1_C1 SeuratProject 763 390 0.39318480
## AD2b_AAACGAAAGGTCGTAG-1_C1 SeuratProject 1030 501 0.00000000
## AD2b_AAACGAAGTCTTGAGT-1_C1 SeuratProject 3556 2272 0.61797753
## AD2b_AAACGAATCACCTCGT-1_C1 SeuratProject 3091 1863 0.03234153
## scrubletScore sample_id paper_cluster group
## AD2b_AAACCCACAGCTGAAG-1_C1 0.03906250 C1 0 C
## AD2b_AAACGAAAGGTCGTAG-1_C1 0.08771930 C1 0 C
## AD2b_AAACGAAGTCTTGAGT-1_C1 0.06688963 C1 12 C
## AD2b_AAACGAATCACCTCGT-1_C1 0.04857143 C1 29 C
## paper_annot
## AD2b_AAACCCACAGCTGAAG-1_C1 Excitatory Neurons
## AD2b_AAACGAAAGGTCGTAG-1_C1 Excitatory Neurons
## AD2b_AAACGAAGTCTTGAGT-1_C1 Inhibitory Neurons
## AD2b_AAACGAATCACCTCGT-1_C1 Inhibitory Neurons
Now that our data is merged we need to reprocess it. This ensures nromalization and scaling is done in a globally context. We can use our processing and clustering functions to analyse our merged dataset.
Our UMAP shows our cells are quite similar. Bu there are a few regions that are distinct depending on the condition.
Reciprocal PCA minimizes the batch effects while merging different data sets.
The steps involved in rPCA:
First we need to identify features for integration. We do this on our list of Seurat objects. This is similar to the VariableFeatures function we ran on a single dataset, but works across all 4 datasets. We will then run scaling and PCA, using these features.
We can then identify anchors. These are the features through which we will integrate our data. Once we have these features, we can then integrate our data sets together.
anchors <- FindIntegrationAnchors(my_seu_list_rpca, anchor.features = feats, reduction = "rpca")
my_seu_merge_rpca <- IntegrateData(anchorset = anchors)
my_seu_merge_rpca## An object of class Seurat
## 28110 features across 10018 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 layer present: data
## 1 other assay present: RNA
To evaluate how well the merge has worked we must check the clustering. Again we must scale, and then use our quick_clust function.
To assess the integration we can use similar metrics to assessing the QC of the datasets in general.
We also check the overlap of our datasets. Generally we expect most cells between samples to overlap, butt this can be very experiment dependent.
We can see that our data sets overlay with each other. It looks slightly better than before. The dataset looked pretty good before, but now the overlap is tighter.
Our UMAP shows our cell clusters are fairly distinct, though not perfect (and the clustering itself could do with some optimization).
We can check the numbers in each cluster. Broadly, there are similar numbers per cluster now. To do this we make a heatmap and then scale it along the row to accunt for the different number of cells in each sample.
library(pheatmap)
tbl <- table(my_seu_merge_rpca$sample_id, my_seu_merge_rpca$seurat_clusters)
pheatmap(tbl, scale = "row")A given cell type should often be clustered together. This means marker genes should be specific. This oligodendrocyte marker has quite specific distribution.
This dataset also has some annotation in the paper_annot slot. We can check the distribution of the labels. Our UMAP shows our cell types are distinct.
Harmony works in a different way to rPCA.
Fuzzy clustering - It will assign clusters, but allow cells to belong to multiple clusters, and caluclate the strength of assingmetn to each cluster.
During clustering there is a penalty term, to maintain diversity of groups i.e. a cluster with just one sample is penalized. This means any structure due to a batch effect is not hard.
The centroid of each cluster is calculated for all cells and each individual sample, then a correction factor for each sample based on these centroids.
Cells are then moved based on the dataset correction factor, weighted by their individual assignment scores for each clsuter
Repeat iteratviely until convergence.
We can prepare for Harmony in much the same way as we prepare for the simple Seurat merge: merge, normalize, scale, PCA and UMAP.
seu_merge <- merge(my_seu_list[[1]], my_seu_list[2:4], add.cell.ids = c("C1", "C3",
"AD2b", "AD4"), project = "Merge")
seu_merge <- data_proc(seu_merge)
seu_merge <- ScaleData(seu_merge)
seu_merge <- RunPCA(seu_merge)
seu_merge <- RunUMAP(seu_merge, reduction = "pca", dims = 1:10, reduction.name = "umap")As you can see we are back with our completely separate groups.
We can use the RunHarmony() function to implement the Harmony correction.
library(harmony)
seu_merge_harmony <- RunHarmony(seu_merge, group.by.vars = "sample_id", assay.use = "RNA")
seu_merge_harmony <- RunUMAP(seu_merge_harmony, reduction = "harmony", dims = 1:10,
reduction.name = "umap_harmony")
seu_merge_harmony <- FindNeighbors(seu_merge_harmony, reduction = "harmony")
seu_merge_harmony <- FindClusters(seu_merge_harmony)## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 10018
## Number of edges: 338251
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9148
## Number of communities: 21
## Elapsed time: 1 seconds
We can see that our data sets overlay with each other. Again, it looks slightly better than before. The dataset looked pretty good before, but now the overlap is tighter.
Our UMAP shows our cell clusters are fairly distinct, though not perfect (and the clustering itself could do with some optimization).
We can check the numbers in each cluster. Broadly, there are similar numbers per cluster now. To do this we make a heatmap and then scale it along the row to account for the different number of cells in each sample.
library(pheatmap)
tbl <- table(seu_merge_harmony$sample_id, seu_merge_harmony$seurat_clusters)
pheatmap(tbl, scale = "row")A given cell type should often be clustered together. This means marker genes should be specific. This oligodendrocyte marker has quite specific distribution.
This dataset also has some annotation in the paper_annot slot. We can check the distribution of the labels. Our UMAP shows our cell types are distinct.
Using heatmaps we can also check how specific each cluster is to each cell type.
tbl <- table(seu_merge_harmony$seurat_clusters, seu_merge_harmony$paper_annot)
pheatmap(tbl, scale = "row")Broadly it seems like rPCA may be the best option in this case. Why?:
Harmony
Works at the cluster level
Iterative nature pushes to convergence so can be a heavy handed
Allows for more complex model terms for batch correction
Fast and scalable
rPCA
Though Harmony and rPCA are our main workhorses for integration. There are many other alternatives too.
For more systematic comparison check out this paper.
As mentioned before this dataset has been manually annotated.
This was done by careful assessment of the marker genes. Here we are look at Oligodendrocyte markers.
To annotate the Single-cell data sets, we can evaluate the gene expression pattern of well known cell-type specific marker genes and make a manual annotation. This is time consuming and not systematic.
Here, we will introduce two more strategies to make cell type annotations automatically: 1. Mapping and annotating query datasets with Seurat using a reference data set. link 2. Make annotation with SingleR link
For this we need a reference dataset and a query dataset. Our annotation ends up being only as good as our reference dataset. There are many sources for reference data.
Depending on the format of the data, you may be set to go straightaway, need to do some light processing, or reanalyze the whole thing.
SingleR is Bioconductor package which can be used to predict annotations of a single-cell dataset. It is maybe the most flexible way of annotating your data, as it will accept a variety of kinds of reference data including bulk and scRNAseq experiments. SingleR works with a very simple method: calculate a spearman rank correlation between reference and test dataset for each label.
Despite this simplicity, there is also scope to do more complex annotation with some advacned features i.e. to improve resolution of related labels or using multiple references. You can dig into this further in the singleR book.
Lets start out by using the Human Primary Cell Atlas. This is a collection of microarray datasets from human primary cells that have been aggregated together. To access this data we will use the celldex package.
In this case the data is stored as a SummarizedExpriment; a specialized Bioconductor format designed for holding data matrices and their metadata. The reference data for SingleR can either be in this format or as a SingleCellExpriment (another Bioconductor format specific to single cell analysis).
## class: SummarizedExperiment
## dim: 19363 713
## metadata(0):
## assays(1): logcounts
## rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
## colData names(3): label.main label.fine label.ont
At the moment our test data is a Seurat object which is not Bioconductor friendly.
## An object of class Seurat
## 28110 features across 10018 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 2 layers present: data, scale.data
## 1 other assay present: RNA
## 2 dimensional reductions calculated: pca, umap
We can simply extract out a data matrix containing count information. This is good for SingleR.
## 5 x 5 sparse Matrix of class "dgCMatrix"
## AAACCCACAGCTGAAG-1_C1 AAACGAAAGGTCGTAG-1_C1 AAACGAAGTCTTGAGT-1_C1
## ADAMTSL1 0.39228540 0.37254012 1.11566489
## THSD7B 0.02286489 0.05190584 -0.01465315
## IL1RAPL2 -0.01904459 -0.08152581 0.12641628
## RELN -0.45800974 -0.30815332 -0.64158447
## AC093607.1 0.06373820 0.06256471 0.01170564
## AAACGAATCACCTCGT-1_C1 AAACGAATCTCCGAAA-1_C1
## ADAMTSL1 -0.063393072 -0.05680296
## THSD7B 0.092400788 -0.02638186
## IL1RAPL2 -0.172235822 -0.00585319
## RELN 3.338650923 0.15192283
## AC093607.1 0.007405319 -0.03035472
We are almost ready to run SingleR. We have our test and reference data. The last part we need is labels. This can just be a vector containing all the labels. We can grab this easily from our reference dataset. We can see there are several options here. Let’s just stick to label.main.
## DataFrame with 713 rows and 3 columns
## label.main label.fine label.ont
## <character> <character> <character>
## GSM112490 DC DC:monocyte-derived:.. CL:0000840
## GSM112491 DC DC:monocyte-derived:.. CL:0000840
## GSM112540 DC DC:monocyte-derived:.. CL:0000840
## GSM112541 DC DC:monocyte-derived:.. CL:0000451
## GSM112661 DC DC:monocyte-derived:.. CL:0000451
## ... ... ... ...
## GSM556665 Monocyte Monocyte:S._typhimur.. CL:0000576
## GSM92231 Neurons Neurons:Schwann_cell CL:0002573
## GSM92232 Neurons Neurons:Schwann_cell CL:0002573
## GSM92233 Neurons Neurons:Schwann_cell CL:0002573
## GSM92234 Neurons Neurons:Schwann_cell CL:0002573
library(SingleR)
pred_res <- SingleR(ref = hpcad, test = my_seu_merge_rpca_mat, labels = hpcad$label.main)The score is generated comparing the expression levels of a cell in query dataset and the expression pattern of certain group (eg. cell types) in reference dataset. A cell would be assigned as the cell type which has highest score
## DataFrame with 2 rows and 4 columns
## scores labels
## <matrix> <character>
## AAACCCACAGCTGAAG-1_C1 0.0699139:0.0443145: 0.00782103:... Neurons
## AAACGAAAGGTCGTAG-1_C1 0.0605863:0.0201250:-0.00486433:... MSC
## delta.next pruned.labels
## <numeric> <character>
## AAACCCACAGCTGAAG-1_C1 0.01087697 Neurons
## AAACGAAAGGTCGTAG-1_C1 0.00253192 MSC
By converting to a matrix, we can check the cell type scoring using a heatmap.
mat <- as.matrix(pred_res$scores)
rownames(mat) <- rownames(pred_res)
pheatmap::pheatmap(mat, scale = "row", show_rownames = FALSE, fontsize_col = 5)We can now add our labels back to our original Seurat object by a quick assignment. This then allows us to start reviewing the annotation in the context of UMAPs and also versus our other annotation.
my_seu_merge_rpca$hpcad_singleR_labels <- pred_res$labels
summ_table <- table(my_seu_merge_rpca$hpcad_singleR_labels, my_seu_merge_rpca$paper_annot)
pheatmap(summ_table, scale = "column", fontsize_row = 5)Lets try an alternative dataset. Often we will have data from other Seurat objects that we want to use as a reference. Here we have a processed version of human data from the Allen Brain Map. This is 10X data from 2020.
There are several interesting labels associated with this data. Lets focus on the class_label.
## orig.ident nCount_RNA nFeature_RNA
## AAACCCAAGGATTTCC-LKTX_190129_01_A01 SeuratProject 14396 4815
## AAACCCAAGTATGGCG-LKTX_190129_01_A01 SeuratProject 12027 4372
## AAACCCACAAAGTGTA-LKTX_190129_01_A01 SeuratProject 16112 5280
## AAACCCACACTACTTT-LKTX_190129_01_A01 SeuratProject 2994 1649
## AAACCCACAGTGAGCA-LKTX_190129_01_A01 SeuratProject 5202 2499
## AAACCCAGTCACCCTT-LKTX_190129_01_A01 SeuratProject 26504 6381
## AAACCCAGTGTCCACG-LKTX_190129_01_A01 SeuratProject 21668 5470
## AAACCCATCATAAGGA-LKTX_190129_01_A01 SeuratProject 30141 7131
## AAACCCATCTGTCCCA-LKTX_190129_01_A01 SeuratProject 31944 6864
## AAACGAAGTATGAAGT-LKTX_190129_01_A01 SeuratProject 32476 7328
## percent.mt class_label
## AAACCCAAGGATTTCC-LKTX_190129_01_A01 0 GABAergic
## AAACCCAAGTATGGCG-LKTX_190129_01_A01 0 Glutamatergic
## AAACCCACAAAGTGTA-LKTX_190129_01_A01 0 Glutamatergic
## AAACCCACACTACTTT-LKTX_190129_01_A01 0 Glutamatergic
## AAACCCACAGTGAGCA-LKTX_190129_01_A01 0 Non-Neuronal
## AAACCCAGTCACCCTT-LKTX_190129_01_A01 0 Glutamatergic
## AAACCCAGTGTCCACG-LKTX_190129_01_A01 0 Glutamatergic
## AAACCCATCATAAGGA-LKTX_190129_01_A01 0 Glutamatergic
## AAACCCATCTGTCCCA-LKTX_190129_01_A01 0 Glutamatergic
## AAACGAAGTATGAAGT-LKTX_190129_01_A01 0 Glutamatergic
## cluster_label subclass_label
## AAACCCAAGGATTTCC-LKTX_190129_01_A01 Inh L1-2 SST CCNJL Sst
## AAACCCAAGTATGGCG-LKTX_190129_01_A01 Exc L5-6 FEZF2 IFNG-AS1 L5/6 NP
## AAACCCACAAAGTGTA-LKTX_190129_01_A01 Exc L3-5 RORB LINC01202 L5 IT
## AAACCCACACTACTTT-LKTX_190129_01_A01 Exc L2 LINC00507 GLRA3 L2/3 IT
## AAACCCACAGTGAGCA-LKTX_190129_01_A01 Oligo L2-6 OPALIN FTH1P3 Oligo
## AAACCCAGTCACCCTT-LKTX_190129_01_A01 Exc L5-6 FEZF2 C9orf135-AS1 L6 CT
## AAACCCAGTGTCCACG-LKTX_190129_01_A01 Exc L3-5 FEZF2 ASGR2 L5 ET
## AAACCCATCATAAGGA-LKTX_190129_01_A01 Exc L3-5 RORB LNX2 L5 IT
## AAACCCATCTGTCCCA-LKTX_190129_01_A01 Exc L3-5 RORB LNX2 L5 IT
## AAACGAAGTATGAAGT-LKTX_190129_01_A01 Exc L5 THEMIS RGPD6 L6 CT
## cell_type_alias_label
## AAACCCAAGGATTTCC-LKTX_190129_01_A01 Inh L1-2 SST CCNJL
## AAACCCAAGTATGGCG-LKTX_190129_01_A01 Exc L5-6 FEZF2 IFNG-AS1
## AAACCCACAAAGTGTA-LKTX_190129_01_A01 Exc L3-5 RORB LINC01202
## AAACCCACACTACTTT-LKTX_190129_01_A01 Exc L2 LINC00507 GLRA3
## AAACCCACAGTGAGCA-LKTX_190129_01_A01 Oligo L2-6 OPALIN FTH1P3
## AAACCCAGTCACCCTT-LKTX_190129_01_A01 Exc L5-6 FEZF2 C9orf135-AS1
## AAACCCAGTGTCCACG-LKTX_190129_01_A01 Exc L3-5 FEZF2 ASGR2
## AAACCCATCATAAGGA-LKTX_190129_01_A01 Exc L3-5 RORB LNX2
## AAACCCATCTGTCCCA-LKTX_190129_01_A01 Exc L3-5 RORB LNX2
## AAACGAAGTATGAAGT-LKTX_190129_01_A01 Exc L5 THEMIS RGPD6
As our reference data is in a Seurat format we can just extract out the data matrix of counts. We also already have our matrix from our test data.
pred_res2 <- SingleR(ref = GetAssayData(abm), test = my_seu_merge_rpca_mat, labels = abm$class_label)## DataFrame with 2 rows and 4 columns
## scores labels delta.next
## <matrix> <character> <numeric>
## AAACCCACAGCTGAAG-1_C1 0.138210:0.142565:0.0505849 Glutamatergic 0.0050974
## AAACGAAAGGTCGTAG-1_C1 0.218222:0.253712:0.0876373 Glutamatergic 0.0347486
## pruned.labels
## <character>
## AAACCCACAGCTGAAG-1_C1 Glutamatergic
## AAACGAAAGGTCGTAG-1_C1 Glutamatergic
By converting to a matrix, we can check the cell type scoring using a heatmap.
mat <- as.matrix(pred_res2$scores)
rownames(mat) <- rownames(pred_res2)
pheatmap::pheatmap(mat, scale = "row", show_rownames = FALSE, fontsize_col = 5)We can now add our labels back to our original Seurat object by a quick assignment. This then allows us to start reviewing the annotation in the context of UMAPs and also versus our other annotation.
summ_table <- table(my_seu_merge_rpca$abm_singleR_labels, my_seu_merge_rpca$paper_annot)
pheatmap(summ_table, scale = "column", fontsize_row = 5)